Tomorrow's Rain Prediction¶

In the following notebook we evaluate two different classifier models (Random Forest and Gradient Boosting) for predicting wether the following day it's going to rain or not.

When executing, the data is taken live from the ADLS account hosting the rain_prediction.parquet dataset. This dataset is updated daily via an automated data pipeline that ingests data from the OpenWeatherMaps API. The ingested data is later transformed until it lands, mostly clean in the before mentioned dataset.

We start by importing the required libraries and creating a helper class for interacting with polars dataframes and plotly figures.

In [1]:
# Standard lib
import sys
from pathlib import Path

# External libs
import polars as pl

import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
from plotly.subplots import make_subplots

from dotenv import load_dotenv

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from imblearn.over_sampling import ADASYN

from matplotlib import pyplot as plt

# Custom lib
sys.path.insert(0, "../openweather-functions/")

from src.destinations.adls import ADLS

# Requires a ./.env file with the following variables in order to authenticate to ADLS
# AZURE_CLIENT_SECRET=""
# AZURE_CLIENT_ID=""
# AZURE_TENANT_ID=""
# AZURE_ACCOUNT_NAME=""
# AZURE_CONTAINER_NAME=""
_ = load_dotenv()

# Required for exporting the plotly figures into HTML
pio.renderers.default = "notebook"


class Label(str):
    """Helper class for pretty printing labels in with plotly"""

    def __init__(self, label):
        super(Label, self).__init__()
        self.name = label
        self.pretty = label.replace("_", " ").capitalize()

Dataset exploration and preparation¶

Download the dataset and create a polars dataframe.

In [2]:
file_name = "rain_prediction.parquet"
adls = ADLS(directory="ml")

try:
    adls.download_file(file_name, file_name)
    df = pl.read_parquet(file_name)
finally:
    Path(file_name).unlink(missing_ok=True)

df.head()
ADLS: Using default envrionment credentials
ADLS: Initializing DataLakeServiceClient
Out[2]:
shape: (5, 15)
recorded_daylocationcloudstemperature_feels_likeavg_temperaturemax_temperaturemin_temperaturehumiditypressuredegreeswind_gustswind_speedrainmax_rainrain_prediction
datestrf64f64f64f32f32f64f64f64f64f64f64f32str
2024-12-31"valencia"75.09.0599989.92999310.8999948.89999493.01031.0330.0null2.062.732.73"moderate"
2025-01-01"valencia"70.010.17167310.89292114.4800117.24002188.0833331029.583333258.4166672.8172.2033335.052.73"dry"
2025-01-02"valencia"6.7510.03292211.14042417.8200075.47000173.9166671026.083333257.753.7168753.2575nullnull"light"
2025-01-03"valencia"10.45833311.80792413.06209119.4500126.24002154.2916671021.916667273.754.9171433.8808331.810.87"dry"
2025-01-04"valencia"11.08333312.95250414.0508420.04000910.24002155.2083331019.958333261.254.97253.672083nullnull"dry"

The dataset consists of 15 columns. We are going to use 10 as predictors and 1 (rain_prediction) to create a column to be predicted. We are going to ignore the recorded_day and location columns, since predictions are going to have the information of the current date to predict the next one.

In [3]:
df = df.fill_null(0.0)
min_date = df["recorded_day"].min()
max_date = df["recorded_day"].max()
unique_locations = len(
    df.select(pl.col("location").str.replace_all("_", " ").str.to_uppercase()).unique()
)

print(
    f"Dataset used includes data from {min_date.isoformat()} to {max_date.isoformat()} "
    f"from {unique_locations} distinct locations for a total of {len(df)} entries."
)
Dataset used includes data from 2024-12-31 to 2025-12-12 from 17 distinct locations for a total of 6002 entries.

Plot the distributions of predictive variables plus the rain variables for visual inspection. Note that the rain variables are not going to be used as predictors, and that the predicted variable is going to be created from the rain_prediction column instead.

In [4]:
x_labels = [
    Label(x)
    for x in [
        "clouds",
        "temperature_feels_like",
        "avg_temperature",
        "max_temperature",
        "min_temperature",
        "humidity",
        "pressure",
        "degrees",
        "wind_gusts",
        "wind_speed",
    ]
]
plot_labels = x_labels + [Label("rain"), Label("max_rain")]

ROWS = 3
COLS = 4
fig = make_subplots(
    rows=ROWS, cols=COLS, subplot_titles=[label.pretty for label in plot_labels]
)

for idx, label in enumerate(plot_labels):
    fig.add_trace(
        go.Histogram(x=df[label], nbinsx=100),
        row=int(idx / COLS) + 1,
        col=(idx % COLS) + 1,
    )

fig.update_layout(showlegend=False, height=500, width=1000)

fig.show()

Standardize predictor variables to generate the independent variable set.

In [5]:
X = df.select(
    (pl.col(label) - pl.col(label).mean()) / pl.col(label).std() for label in x_labels
)

X.head()
Out[5]:
shape: (5, 10)
cloudstemperature_feels_likeavg_temperaturemax_temperaturemin_temperaturehumiditypressuredegreeswind_gustswind_speed
f64f64f64f32f32f64f64f64f64f64
1.246957-0.996281-1.028313-1.383602-0.5090391.4777312.4230941.999844-1.160838-0.672577
1.080721-0.844899-0.886313-0.939287-0.7730781.1714192.1746041.038449-0.407171-0.5937
-1.022164-0.863793-0.849815-0.524761-1.0546210.2888271.5606871.029495-0.166416-0.013588
-0.898872-0.622081-0.566433-0.322462-0.93214-0.9338240.8298341.2443820.1547070.329435
-0.878093-0.466217-0.420625-0.249237-0.295892-0.8767150.4863321.0765010.1695170.214559

Generate the variable to predict. In this case the rain_prediction column, which can be dry, light, moderate and intense is aggregated into either dry or rain. This variable indicates wether the following day rained or not and it is what we are going to train our models to predict. Furthermore, create another column that will have 0 or 1 values to indicate the same thing.

In [6]:
df = df.with_columns(
    pl.when(pl.col("rain_prediction").is_in(("light", "moderate", "intense")))
    .then(pl.lit("rain"))
    .otherwise(pl.lit("dry"))
    .alias("simple_rain_prediction"),

).with_columns(
    pl.when(pl.col("simple_rain_prediction") == pl.lit("rain"))
    .then(1)
    .otherwise(0)
    .alias("b_simple_rain_prediction")
)
y = df.select("b_simple_rain_prediction")

df.select("simple_rain_prediction", "b_simple_rain_prediction").head()
Out[6]:
shape: (5, 2)
simple_rain_predictionb_simple_rain_prediction
stri32
"rain"1
"dry"0
"rain"1
"dry"0
"dry"0

Evaluate the balancedness of the dataset. Currently, the data is obtained from different locations in Spain, therefore, it is expected to be imbalanced towards the dry days.

In [7]:
label_count = (
    df.select(pl.col("simple_rain_prediction"))
    .group_by(pl.col("simple_rain_prediction"))
    .agg(pl.col("simple_rain_prediction").count().alias("count"))
)

fig = px.bar(
    label_count,
    x="simple_rain_prediction",
    y="count",
    title="Label Distribution in Dataset Before Resampling",
    labels={"simple_rain_prediction": "Simple Rain Prediction", "count": "Count"},
    width=500,
    height=500
)

fig.show()

Separeate the dataset into train and test datasets (80/20).

Furthermore, use ADASYN to synthetically balance the training dataset.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)

X_train_resampled, y_train_resampled = ADASYN().fit_resample(X_train.to_numpy(), y_train.to_numpy())

print(f"Total: {len(X)}")
print(f"Training: {len(y_train)}")
print(f"Resampled: {len(y_train_resampled)}")
print(f"Test: {len(y_test)}")
Total: 6002
Training: 4801
Resampled: 7338
Test: 1201

Plot again the count of dry and rain cases for the resampled training dataset.

In [9]:
label_count_resampled = pl.DataFrame({
    "simple_rain_prediction": ["dry", "rain"],
    "count": [y_train_resampled.size - y_train_resampled.sum(), y_train_resampled.sum()],
})

fig = px.bar(
    label_count_resampled,
    x="simple_rain_prediction",
    y="count",
    title="Label Distribution in Dataset After Resampling",
    labels={"simple_rain_prediction": "Simple Rain Prediction", "count": "Count"},
    width=500,
    height=500
)
fig.show()

Model training and evaluation¶

We create an evaluator factory with the obtained datasets. This will make it much easier to create evaluations with different models.

In [10]:
def get_model_evaluator(X_train, X_test, y_train, y_test):
    def evaluate_model(model, name):

        model = model.fit(X_train, y_train.to_numpy().ravel())
        confusion_mat = confusion_matrix(y_test, model.predict(X_test))

        tn, fp = confusion_mat[0]
        fn, tp = confusion_mat[1]
        total = confusion_mat.sum()

        p = tp + fn
        n = tn + fp

        tpr = sensitivity = tp / p
        tnr = specificity = tn / n
        precision = tp / (tp + fp)
        accuracy = (tp + tn) / total
        balanced_accuracy = (tpr + tnr) / 2

        metrics = {
            "accuracy": accuracy,
            "balanced_accuracy": balanced_accuracy,
            "sensitivity": sensitivity,
            "accuracy": accuracy,
            "precision": precision,
        }

        print(f"EVALUATION METRICS FOR '{name}'")
        print("===============================")
        print(f"| Accuracy          | {accuracy:.3%} |")
        print("===============================")
        print(f"| Balanced Accuracy | {balanced_accuracy:.3%} |")
        print("===============================")
        print(f"| Precision (PPV)   | {precision:.3%} | ")
        print("===============================")
        print(f"| Sensitivity (TPR) | {sensitivity:.3%} | ")
        print("===============================")
        print(f"| Specificity (TNR) | {specificity:.3%} |")
        print("===============================")

        fig, ax = plt.subplots(1, 1)
        disp = ConfusionMatrixDisplay(confusion_mat, display_labels=["dry", "rain"])
        disp.plot(ax=ax)
        _ = ax.set_title(name)

        return fig, ax, metrics

    return evaluate_model


model_evaluator = get_model_evaluator(X_train, X_test, y_train, y_test)

Use the evaluator function to train and evaluate the two selected models.

First the Random Forest.

In [11]:
_ = model_evaluator(RandomForestClassifier(n_estimators=100), "Random Forest")
EVALUATION METRICS FOR 'Random Forest'
===============================
| Accuracy          | 78.102% |
===============================
| Balanced Accuracy | 57.817% |
===============================
| Precision (PPV)   | 50.893% | 
===============================
| Sensitivity (TPR) | 21.509% | 
===============================
| Specificity (TNR) | 94.124% |
===============================
No description has been provided for this image

Secondly, the Gradient Boosting.

In [12]:
_ = model_evaluator(GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")
EVALUATION METRICS FOR 'Gradient Boosting'
===============================
| Accuracy          | 78.601% |
===============================
| Balanced Accuracy | 57.731% |
===============================
| Precision (PPV)   | 54.000% | 
===============================
| Sensitivity (TPR) | 20.377% | 
===============================
| Specificity (TNR) | 95.085% |
===============================
No description has been provided for this image

Results evaluation¶

TODO